Skip to content

feat(gdb): debug guest kernel of a restored microVM#19

Open
kalyazin wants to merge 10 commits into
firecracker-v1.14-direct-memfrom
kalyazin/gdb-restore-path
Open

feat(gdb): debug guest kernel of a restored microVM#19
kalyazin wants to merge 10 commits into
firecracker-v1.14-direct-memfrom
kalyazin/gdb-restore-path

Conversation

@kalyazin

Copy link
Copy Markdown

Why

Upstream wires gdb only into the boot path, so a snapshot-resumed microVM can't be
debugged — and the stub isn't all-stop, so multi-vCPU inspection hangs. We need to gdb
the guest kernel of a resumed snapshot (investigate resume behavior / slow
envd-init) with full source-level symbols, KASLR on.

What (all gdb code behind --features gdb; default/prod build untouched)

  • start gdb on restore — accept a gdb_socket_path restore-time override on the
    load-snapshot request; build_microvm_from_snapshot wires attach_debug_info +
    gdb_thread at the restored RIP. Env fallback FIRECRACKER_GDB_SOCKET.
  • all-stop — pause sibling vCPUs on every stop (QEMU-style) so info threads /
    per-vCPU backtraces work.
  • drain stale debug events on resume — fixes a latent multi-vCPU race that dropped
    the gdb connection (GdbQueueError) under a breakpoint storm.
  • snapshot-editor: print saved MSRs — so the KASLR slide can be recovered from a
    snapshot (MSR_LSTAR).
  • integration test — restore (file/UFFD, 4K/2M hugetlb) on the prod DWARF kernel,
    recover the slide, hit a breakpoint, read kernel structs, enumerate vCPUs, and
    attribute guest page faults to process+VMA.

kalyazin and others added 5 commits June 12, 2026 17:24
memory_info.rs, pagemap.rs and meminfo.rs (added by the guest-memory
introspection API work) ship without the SPDX/Apache-2.0 header the
license style check (integration_tests/style/test_licenses.py) requires.
Prepend the standard two-line Amazon/Apache-2.0 header.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
test_balloon_wait_on_ack.py and test_drive_virtio.py are not
black-formatted, so the python style check fails. Reformat them with
`black --config tests/pyproject.toml`; no test-logic change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
docs/api_requests/block-write-zeroes.md and docs/ballooning.md are not
mdformat-clean, failing the markdown style check. Reformat with
mdformat; no content change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Several fork files predating this branch are not rustfmt-clean under
tests/fmt.toml, failing the rust style check. Run `cargo fmt`; a
mechanical reformat only, no logic change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Under `--features gdb`, clippy's mismatched_lifetime_syntaxes fires on
five FirecrackerTarget trait methods that take `&mut self` and return a
gdbstub `*Ops` type whose lifetime is elided in the path: the borrow is
visible on the receiver but hidden in the return type. Spell the
lifetime as `<'_, ...>` so the two syntaxes match. No behavior change;
makes `cargo clippy --features gdb` clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
@cla-bot cla-bot Bot added the cla-signed label Jun 12, 2026
@cursor

cursor Bot commented Jun 12, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
GDB is feature-gated and restore-only override, but changes vCPU pause/resume and debug-event handling on multi-vCPU paths; default builds unaffected.

Overview
Adds GDB guest-kernel debugging on snapshot restore (x86_64, --features gdb): load-snapshot accepts optional gdb_socket_path (API + swagger); restore applies it via MachineConfigUpdate and build_microvm_from_snapshot wires vCPU debug channels and starts the GDB server at the saved RIP. Boot path gains resolve_gdb_socket_path (machine-config or FIRECRACKER_GDB_SOCKET).

All-stop multi-vCPU behavior: pause_all_vcpus on initial attach, each stop, and Ctrl-C; resume_all_vcpus drains stale queued debug events before resume to avoid multi-vCPU GdbQueueError under breakpoint storms.

snapshot-editor prints per-MSR index/data from vcpu state (KASLR slide from MSR_LSTAR). New integration tests cover restore+GDB (file/UFFD, 4K/2M) and fault attribution; test harness passes gdb_socket_path and skips post-resume SSH when GDB holds the guest.

Minor: copyright headers, doc/markdown wrapping, import/format churn in persist modules and tests.

Reviewed by Cursor Bugbot for commit 9eeafcd. Bugbot is set up for automated code reviews on this repo. Configure here.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 47dfd92. Configure here.

Comment thread src/vmm/src/builder.rs
kalyazin and others added 5 commits June 12, 2026 18:31
Upstream wires gdb only into the boot path; restored microVMs never
started the gdb server. Accept a gdb_socket_path restore-time override
on the load-snapshot request (alongside network_overrides and
clock_realtime) and wire attach_debug_info + gdb_thread into
build_microvm_from_snapshot (x86_64), arming the entry breakpoint at the
restored vCPU RIP so gdb takes control at the resume point.

Carrying the socket on LoadSnapshotParams keeps it a pure restore-time
knob: no machine-config update is needed before the load (which would
forbid the snapshot load), and there is no boot-time value to preserve
across restore. persist sets the restored machine config's
gdb_socket_path from the load param, which the snapshot builder reads.

Also add resolve_gdb_socket_path() with a FIRECRACKER_GDB_SOCKET env
fallback, so launchers that cannot set the load request can still enable
gdb.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
When one vCPU stopped at a debug event the others kept running, so
querying a running vCPU (info threads, per-vCPU backtraces) blocked
indefinitely. Pause the sibling vCPUs on every stop (initial entry stop,
breakpoint stops, and Ctrl-C), like QEMU's all-stop, reusing the
existing per-vCPU pause (which kicks a running or halted vCPU out of
KVM_RUN).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
When more than one vcpu hits a breakpoint while the VM runs, each sends
a debug event and parks itself in the paused emulation state. The gdb
event loop reports the first and force-pauses the rest, but their
already-queued debug events are never consumed. On the next resume those
stale events remain, so a following `wait_for_stop_reason` dequeues one
and processes it against a vcpu that has since resumed: it marks a
running vcpu as paused, desyncing the pause/resume handshake until the
vcpu threads exit and the event channel disconnects — surfacing as a
fatal `GdbQueueError` ("Remote connection closed" on the client) under a
sustained multi-vcpu breakpoint storm.

Drain the debug-event queue at the start of `resume_all_vcpus`. Every
vcpu is paused there, so none can emit an event and anything queued is
provably stale; dropping it is safe and keeps `vcpu_state` in sync with
the vcpus.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
The derived Debug of a vcpu's saved_msrs shows only the kvm_msrs
headers, not the entries (a FAM array). Print each saved MSR's index and
data so tooling can read values from a snapshot — e.g. MSR_LSTAR
(entry_SYSCALL_64), used to recover the KASLR image slide of a restored
guest for source-level debugging.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Two tests on the production kernel built with DWARF (KASLR on), passing
the gdb socket as a load-snapshot restore-time override. Both recover
the KASLR image slide from the snapshot (MSR_LSTAR via snapshot-editor
vs the link-time entry_SYSCALL_64) and attach gdb to the restored guest:

- test_gdb_restore: multi-vCPU, restore file- and UFFD-backed (4K and 2M
  hugetlb), hit a breakpoint, print kernel structures/memory, and
  enumerate both vCPUs (info threads) with a per-vCPU backtrace.
- test_gdb_restore_fault_attribution: attribute guest page faults to
  process+VMA (comm/pid/addr/VMA) by breaking handle_mm_fault on the
  restored multi-vCPU VM under a sustained fault storm. Doubles as a
  regression test for the stale debug-event drain on resume (two vCPUs
  hammering the breakpoint).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
@kalyazin kalyazin force-pushed the kalyazin/gdb-restore-path branch from 47dfd92 to 9eeafcd Compare June 12, 2026 17:32
@kalyazin kalyazin marked this pull request as ready for review June 15, 2026 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant